Skip to content

prover: GPU compression path + plumbed (gated) GPU aggregation#3041

Open
gbotrel wants to merge 4 commits into
mainfrom
prover/gpu-compression
Open

prover: GPU compression path + plumbed (gated) GPU aggregation#3041
gbotrel wants to merge 4 commits into
mainfrom
prover/gpu-compression

Conversation

@gbotrel
Copy link
Copy Markdown
Contributor

@gbotrel gbotrel commented May 8, 2026

Summary

Wires GPU acceleration for the compression proof (data-availability-v2)
into the production prover and plumbs in — but does not enable — the GPU
path for the aggregation proof.

  • Compression: auto-enabled whenever a CUDA device is reachable. Wall-clock on
    the reference host drops from ~4:40 (CPU) to ~2:10 per proof (3-proof
    batch average 2:10.19 on AWS g7e.8xlarge with one RTX PRO 6000 Blackwell).
  • Aggregation: GPU code is wired (gpu/plonk2 PI/BW6/BN254, gpu/vortex PI
    MiMC + ring-SIS, gpu/quotient) but off by default. Operators must set
    LINEA_PROVER_GPU_AGGREGATION=1 to opt in. Production should leave it off
    for now.
  • Controller: when launched on a GPU host, only compression jobs are
    accepted. Execution / aggregation / invalidity files are ignored even if
    the corresponding Enable* toggles are on, so a GPU host never falls back
    to a slow CPU path for non-compression work.

⚠️ DevOps must read this

1. Build flags

The CPU binary is unchanged. The GPU binary requires the cuda build tag and
links against the static libgnark_gpu.a produced from prover/gpu/cuda.

# CPU only
make bin/prover                    # GO_BUILD_TAGS=debug, no CUDA dependency

# GPU
make bin/prover-cuda               # GO_BUILD_TAGS=debug,cuda; needs libgnark_gpu.a
# equivalently:
make GO_BUILD_TAGS=debug,cuda bin/prover

bin/prover-cuda is a new make target. The static library
prover/gpu/cuda/build/libgnark_gpu.a must already exist before linking; the
CMake build is unchanged from the existing gpu/cuda source tree.

2. Host class per job type

Job type Host class Activation
Compression (DA-v2) GPU (g7e.8xlarge, 1× RTX PRO 6000 Blackwell) Auto-detected — no env var needed
Execution / Limitless CPU (existing class, unchanged) Unchanged
Aggregation CPU (existing class, unchanged) Stays on CPU; do not set the flag below
Invalidity CPU (existing class, unchanged) Unchanged

3. Controller behavior on GPU hosts

cmd/controller checks gpu.HasDevice() at start. When true:

  • It accepts only *-getZkBlobCompressionProof.json jobs.
  • EnableExecution, EnableAggregation, EnableInvalidity from the config
    are ignored even if they are true.

This is intentional — running the CPU-paths on a GPU host with 32 vCPU would
be much slower than dispatching them to the existing CPU pool.

CPU controllers are unchanged: same Enable* semantics as today.

4. Required runtime env vars

Compression (GPU host):

GOMEMLIMIT=180GiB
GOGC=75
# nothing else; GPU is auto-detected

These two values are baked into the reference run and keep peak Go heap
usage at ~200 GiB on a 249 GiB host without thrashing the GC.

Aggregation (CPU host) — unchanged from origin/main today.

5. Required runtime resources

  • GPU host (compression): 1× RTX PRO 6000 Blackwell, 96 GiB VRAM. Peak
    VRAM usage observed: ~80 GiB. Do not schedule another GPU process on
    the same card while a compression proof is in flight.
  • Disk: the prover-assets 7.1.0/data-availability-v2/ directory must be
    present on the host. The canonical SRS is read once per process and
    benefits substantially from being in OS page cache; a freshly-booted host
    pays ~2 min of cold-cache cost on the first proof. Subsequent runs
    hit the table below.
  • Memory: peak host RSS ~210 GiB (large because GPU pinned-memory
    staging buffers are reused across rounds and the gnark Go heap is
    intentionally large under GOMEMLIMIT=180GiB).

6. Compression reference numbers (3 sorted requests)

Run Block range Wall time Setup load Solver GPU prover Max RSS CPU
1 30388561-30389025 2:10.41 16.81s 33.12s 1:43.61 200.7 GiB 285%
2 30389026-30389504 2:10.21 16.86s 33.11s 1:43.31 200.7 GiB 285%
3 30389505-30390023 2:09.96 16.90s 33.08s 1:43.12 200.6 GiB 286%

Average wall time: 2:10.19

Per-phase decomposition (from prover logs): solve 33 s → init GPU
instance ~19 s → MSM commit L,R,O ~4 s → build/iFFT/commit Z ~8 s →
quotient GPU ~25 s → MSM h₁,h₂,h₃ ~4 s → eval+linearize+open Z ~7 s →
batch opening ~4 s.

Raw artifacts under prover/reference-benchmarks/results/2026-05-08-g7e-8xlarge-gpu-compression-final/.

7. Proof-flow summary

Compression (GPU host)
  controller picks up *getZkBlobCompressionProof.json*
  └─ bin/prover prove ...
     ├─ LoadSetup (canonical SRS only — GPU path)         ~17s
     ├─ Solver (gnark constraint system)                  ~33s
     └─ gpu/plonk2/bls12377.GPUProve (BLS12-377 PlonK)   ~1:43

Aggregation (CPU host, unchanged)
  controller picks up *getZkAggregatedProof.json*
  └─ bin/prover prove ...
     ├─ makePiProof  → PI wizard + BLS12-377 PlonK   (CPU)
     ├─ makeBw6Proof → BW6-761 PlonK                 (CPU)
     └─ makeBn254Proof → BN254 emulation PlonK       (CPU)

Aggregation (GPU host, opt-in only — DO NOT ENABLE TODAY)
  Same flow + LINEA_PROVER_GPU_AGGREGATION=1 → all three Plonk phases on
  gpu/plonk2; PI Vortex MiMC and ring-SIS on gpu/vortex; quotient
  evaluation on gpu/quotient.

8. Rollback

The compression GPU path can be disabled at runtime by deploying the
non-cuda bin/prover (or by hiding the GPU device from the prover process,
e.g. CUDA_VISIBLE_DEVICES=""). No code change required — the prover falls
back to gnark's CPU PlonK prover.

The aggregation GPU path is off by default; nothing to roll back unless an
operator explicitly set LINEA_PROVER_GPU_AGGREGATION=1 (just unset it).

Test plan

  • go build ./... (CPU)
  • go build -tags cuda,debug ./... (GPU)
  • go test ./gpu/plonk2/... -tags cuda,debug (per-curve correctness vs gnark CPU reference)
  • go test ./gpu -tags cuda,debug (device singleton)
  • go test ./circuits/... ./backend/aggregation/... ./cmd/controller/... (touched packages)
  • End-to-end compression on provertestdata2 × 3 sorted requests, all valid
  • Smoke run by devops on a staging GPU host (g7e.8xlarge or equivalent)
  • Confirm the controller on a CPU host still accepts execution/aggregation jobs
  • Confirm the controller on a GPU host rejects execution/aggregation jobs

🤖 Generated with Claude Code


Note

High Risk
High risk because it introduces a new GPU-backed proving path (gpu/plonk2 via CGO/CUDA) and refactors setup/SRS loading and PI/quotient/vortex hashing logic; mistakes could cause incorrect proofs, runtime failures, or performance regressions across critical proving flows.

Overview
Enables GPU-accelerated proving for data-availability (compression) by threading a new circuits.WithGPU option through ProveCheck, skipping Lagrange SRS loads when on GPU, and eagerly prefetching setups to reduce wall time.

Plumbs a gated GPU path for aggregation (PI → BW6 → BN254) behind LINEA_PROVER_GPU_AGGREGATION, including GPU-backed PI Vortex (MiMC + ring-SIS) and quotient coset reevaluation (CUDA-tagged implementations with CPU fallbacks).

Adds CUDA build tooling and ergonomics: bin/prover-cuda make target, CUDA typecheck in CI (go vet -tags=cuda ./gpu/...), new gpu/cuda CMake build files, and bumps the prover version/dependencies to support these changes.

Reviewed by Cursor Bugbot for commit 066520e. Bugbot is set up for automated code reviews on this repo. Configure here.

* Compression (data-availability-v2) auto-enables the gpu/plonk2 prover
  whenever a CUDA device is reachable. Wall-clock on the reference host
  drops from ~4:40 (CPU) to ~2:10 per proof.
* Aggregation GPU plumbing (gpu/plonk2 PI/BW6/BN254 + gpu/vortex PI MiMC
  and ring-SIS + gpu/quotient) is wired but disabled by default behind
  $LINEA_PROVER_GPU_AGGREGATION; leave the flag off in production for now.
* cmd/controller refuses execution / aggregation / invalidity jobs when a
  GPU is detected; only compression is accepted on a GPU host.

See prover/reference-benchmarks/README.md for the host class, build
command, runtime flags and 3-proof compression reference (avg 2:10.19 on
AWS g7e.8xlarge with an RTX PRO 6000 Blackwell).
Copilot AI review requested due to automatic review settings May 8, 2026 19:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Comment thread prover/config/config-mainnet-limitless.toml Outdated
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 8, 2026

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.84%. Comparing base (a1a9917) to head (066520e).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #3041   +/-   ##
=========================================
  Coverage     75.84%   75.84%           
  Complexity     6844     6844           
=========================================
  Files          1121     1121           
  Lines         44508    44508           
  Branches       5355     5355           
=========================================
  Hits          33755    33755           
  Misses         9469     9469           
  Partials       1284     1284           
Flag Coverage Δ *Carryforward flag
hardhat 96.17% <ø> (ø)
kotlin 52.38% <ø> (ø) Carriedforward from c379756
lido-governance-monitor 97.61% <ø> (ø) Carriedforward from c379756
linea-native-libs 90.69% <ø> (ø) Carriedforward from c379756
linea-shared-utils 96.18% <ø> (ø) Carriedforward from c379756
native-yield-automation-service 97.68% <ø> (ø) Carriedforward from c379756
postman 99.92% <ø> (ø) Carriedforward from c379756
sdk-core 98.09% <ø> (ø) Carriedforward from c379756
sdk-ethers 89.83% <ø> (ø) Carriedforward from c379756
sdk-viem 99.45% <ø> (ø) Carriedforward from c379756
tracer 88.56% <ø> (ø) Carriedforward from c379756

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 12, 2026

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn High
Obfuscated code: golang github.com/pelletier/go-toml/v2 is 90.0% likely obfuscated

Confidence: 0.90

Location: Package overview

From: ?golang/github.com/spf13/viper@v1.21.0golang/github.com/pelletier/go-toml/v2@v2.3.1

ℹ Read more on: This package | This alert | What is obfuscated code?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore golang/github.com/pelletier/go-toml/v2@v2.3.1. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

- config-mainnet-limitless.toml: restore relative paths (dev-host
  absolute paths leaked into the committed prod config).
- prover-testing.yml: run `go vet -tags=cuda ./gpu/...` in the static
  check job so CPU refactors that break GPU compilation are caught.
  vet compiles but does not link, so no CUDA toolchain needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gbotrel gbotrel requested a review from gusiri May 12, 2026 14:44
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c379756. Configure here.

@gbotrel gbotrel requested a review from ThomasPiellard May 12, 2026 14:59
Adds deterministic byte-level parity checks between the GPU plonk
prover's Fiat-Shamir helpers and the audited gnark CPU construction.

* TestFiatShamirChallengeParity (+ NoBsb22 variant) — replays the four
  prover challenges (gamma, beta, alpha, zeta) through the GPU's
  bindPublicData/deriveRandomness helpers and compares each derived
  fr.Element against an inline reference built directly on
  gnark-crypto's public fiat-shamir API. The reference mirrors
  gnark CPU's exact bind order from backend/plonk/{curve}/{prove,
  verify}.go.
* TestFiatShamirBatchOpenParity — exercises gpuBatchOpen's KZG-folding
  FS instance against gnark-crypto's kzg.BatchOpenSinglePoint on
  identical synthetic inputs (same polys, digests, claimed values,
  point, dataTranscript, SRS, and folding hash). When the gamma
  folding challenge matches byte-for-byte, the quotient commitment
  H is bit-identical; any FS drift yields a different H.

Generated for bn254, bls12377, bw6761 via the existing template
pipeline. All 9 tests pass locally on RTX PRO 6000 Blackwell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants